This notebook is provided as a case of study of Blockchain-Based financial assets, in particular, Bitcoin and Ethereum cryptocurrencies. Data within this notebook was extracted from the Kraken exchange and making use of the ccxt library.
In order to run this notebook, it is necessary to have installed and/or have the requirements.txt file with the following:
The following are the file dependencies that are needed to run this notebook:
%%capture
# Install all the pip packages in the requirements.txt
import sys
!{sys.executable} -m pip install -r requirements.txt
# -- Generic
import pandas as pd
import numpy as np
import json
# -- Visualization
import plotly.graph_objects as go
# -- Project scripts
import visualizations as vs
import functions as fn
Historical Orderbook data is stored in a JSON formated file, which stands for Java Script Object Notation, in practical terms, it is very similar to a python dictionary since it has a list-like structure of storing the data, with one or several keys and each of which can have more keys and or data points.
Remember also that a Pandas DataFrame is also in its core form a dictionary where each colum is a key and the column contents are each key contents. So, in a way, a JSON file is one option to represent a multidimensional DataFrame.
To read the contents of the file is very simple, by using the library json, it is a pre-installed one and you do need to import it but its already installed in the python base installation.
# Opening JSON file
f = open('files/orderbooks_05jul21.json')
# Returns JSON object as a dictionary
orderbooks_data = json.load(f)
Lets check the contents of the file.
orderbooks_data.keys()
dict_keys(['bitfinex', 'kraken'])
Allright, we have 2 keys, each of which corresponds to different exchanges where BTC/USDT is listed. Lets get store the timestamps of the two in sepparate objects, we will need those for later calculations.
bitfinex_ts = list(orderbooks_data['bitfinex'].keys())
kraken_ts = list(orderbooks_data['kraken'].keys())
Now lets see some other info of each of the list, in priciple we are interested in just one for now so this calculation can apply to any of those two. Lets use as an example the data we have from Bitfinex.
Inside the functions file, create a function named f_timestamps_info with input parameteres ts_list_o and ts_list_d, and the output must be a dict type object with the following keys and content:
Place the code for this function in the functions.py file and should be callable from the following cell.
q1_results = fn.f_compare_ts(ts_list_o=bitfinex_ts, ts_list_d=kraken_ts)
To practice the use of display function on jupyter notebooks, you should delivery the exact same result
# Delete this comment and place your code here in this cell
display('Timestamps in Origin data:')
display('First Timestamp: ' + q1_results["first_o"])
display('Last Timestamp: ' + q1_results["last_o"])
display('Total number of orderbooks: ' + str(q1_results["qty_o"]))
display('Timestamps in Destination data:')
display('First Timestamp: ' + q1_results["first_d"])
display('Last Timestamp: ' + q1_results["last_d"])
display('Total number of orderbooks: ' + str(q1_results["qty_d"]))
display('Exact match of Timestamps: ' + str(q1_results["exact_match"]["qty"]))
display('First 2 values are: ')
display([q1_results["exact_match"]["values"][0].strftime("%Y-%m-%dT%H:%M:%S.%fZ"),q1_results["exact_match"]["values"][1].strftime("%Y-%m-%dT%H:%M:%S.%fZ")])
display('Last 2 values are: ')
display([q1_results["exact_match"]["values"][-1].strftime("%Y-%m-%dT%H:%M:%S.%fZ"),q1_results["exact_match"]["values"][-2].strftime("%Y-%m-%dT%H:%M:%S.%fZ")])
'Timestamps in Origin data:'
'First Timestamp: 2021-07-05T13:06:46.571000Z'
'Last Timestamp: 2021-07-05T14:06:49.417000Z'
'Total number of orderbooks: 3602'
'Timestamps in Destination data:'
'First Timestamp: 2021-07-05T13:06:46.571000Z'
'Last Timestamp: 2021-07-05T14:06:49.417000Z'
'Total number of orderbooks: 3602'
'Exact match of Timestamps: 3602'
'First 2 values are: '
['2021-07-05T13:06:46.571000Z', '2021-07-05T13:06:47.918000Z']
'Last 2 values are: '
['2021-07-05T14:06:49.417000Z', '2021-07-05T14:06:46.749000Z']
display(bitfinex_ts[0:4])
display(kraken_ts[0:4])
['2021-07-05T13:06:46.571Z', '2021-07-05T13:06:47.918Z', '2021-07-05T13:06:49.414Z', '2021-07-05T13:06:51.077Z']
['2021-07-05T13:06:46.571Z', '2021-07-05T13:06:47.918Z', '2021-07-05T13:06:49.414Z', '2021-07-05T13:06:51.077Z']
From now and on for the purpose of this lab, we are going to use only the information of one exchange, and also we are going to have the following order of columns (it is just a convention) 'bid_size', 'bid', 'ask' and 'ask_size'
ob_data = orderbooks_data['kraken']
There exists a posibility of having None or empty values for any timestamp contained in the data set, so for the next sections on this lab we need to use the keys of the dict with not None values.
Next incllude your code that generates the output that is showed.
ob_data_clean = {k: v for k, v in ob_data.items() if v is not None}
display('Number of historical Kraken orderbooks:')
display('Before dropping Nones: ' + str(len(ob_data.keys())))
display('After dropping Nones: ' + str(len(ob_data_clean.keys())))
'Number of historical Kraken orderbooks:'
'Before dropping Nones: 3602'
'After dropping Nones: 1201'
Lets see the first timestamp from where we do have data
ob_data = ob_data_clean.copy()
display('First timestamp: ' + list(ob_data.keys())[0])
print('')
display('The Orderbook data is the following:')
print('')
display(pd.DataFrame(ob_data[list(ob_data.keys())[0]]))
'First timestamp: 2021-07-05T13:06:49.495Z'
'The Orderbook data is the following:'
| ask_size | ask | bid | bid_size | |
|---|---|---|---|---|
| 0 | 3.352 | 28298.4 | 28298.3 | 0.880 |
| 1 | 0.020 | 28300.0 | 28293.9 | 0.350 |
| 2 | 0.074 | 28305.4 | 28293.5 | 0.377 |
| 3 | 0.123 | 28305.5 | 28292.2 | 0.011 |
| 4 | 0.204 | 28307.3 | 28292.0 | 1.221 |
| ... | ... | ... | ... | ... |
| 95 | 0.001 | 28430.4 | 28179.1 | 3.726 |
| 96 | 0.001 | 28431.0 | 28178.9 | 3.059 |
| 97 | 3.840 | 28431.8 | 28169.6 | 3.733 |
| 98 | 3.032 | 28432.1 | 28168.2 | 2.756 |
| 99 | 4.470 | 28446.7 | 28167.8 | 2.498 |
100 rows Ć 4 columns
... And what about the last timestamp we have data from?
display('Last timestamp: ' + list(ob_data.keys())[-1])
print('')
display('The Orderbook data is the following:')
print('')
display(pd.DataFrame(ob_data[list(ob_data.keys())[-1]]))
'Last timestamp: 2021-07-05T14:06:49.417Z'
'The Orderbook data is the following:'
| ask_size | ask | bid | bid_size | |
|---|---|---|---|---|
| 0 | 0.810 | 28396.9 | 28396.8 | 0.001 |
| 1 | 0.743 | 28397.9 | 28389.7 | 0.016 |
| 2 | 0.005 | 28400.9 | 28389.3 | 0.165 |
| 3 | 0.190 | 28402.7 | 28388.1 | 0.001 |
| 4 | 0.002 | 28402.9 | 28386.8 | 0.550 |
| ... | ... | ... | ... | ... |
| 95 | 0.113 | 28512.9 | 28286.4 | 3.886 |
| 96 | 0.063 | 28513.1 | 28282.4 | 9.627 |
| 97 | 3.117 | 28514.2 | 28279.9 | 3.693 |
| 98 | 3.000 | 28517.0 | 28279.8 | 0.001 |
| 99 | 0.001 | 28517.6 | 28279.6 | 4.396 |
100 rows Ć 4 columns
All right, looks good. We just need to switch the columns in order as its more often used in industry. After that, we should make a plot to have a different visual representation of that data, lets use plotly
ob_data = {orderbook: pd.DataFrame(ob_data[orderbook])[['bid_size', 'bid', 'ask', 'ask_size']]
for orderbook in list(ob_data.keys())}
ob_data[list(ob_data.keys())[0]]
| bid_size | bid | ask | ask_size | |
|---|---|---|---|---|
| 0 | 0.880 | 28298.3 | 28298.4 | 3.352 |
| 1 | 0.350 | 28293.9 | 28300.0 | 0.020 |
| 2 | 0.377 | 28293.5 | 28305.4 | 0.074 |
| 3 | 0.011 | 28292.2 | 28305.5 | 0.123 |
| 4 | 1.221 | 28292.0 | 28307.3 | 0.204 |
| ... | ... | ... | ... | ... |
| 95 | 3.726 | 28179.1 | 28430.4 | 0.001 |
| 96 | 3.059 | 28178.9 | 28431.0 | 0.001 |
| 97 | 3.733 | 28169.6 | 28431.8 | 3.840 |
| 98 | 2.756 | 28168.2 | 28432.1 | 3.032 |
| 99 | 2.498 | 28167.8 | 28446.7 | 4.470 |
100 rows Ć 4 columns
Inside the visualizations.py file, there is a function named plot_orderbook with input parameteres ob_data_plot, using that function, you need to first select the day that had the highest total volume in the orderbook, and then use it to plot the first 20 levels for Bid side and first 20 levels for Ask side.
Note: If you find more than 1 timestamp with the exact same volume, use the one that is closest to the present.
vs.plot_orderbook(ob_data)
Now we need to calculate the historical bid, ask and spread. That means to calculate them using the Top of the Book, i.e. using only the best bid, the bes ask. So, it turns out we need to iterate over all orderbook data we have and calculate those values.
So, create a DataFrame, df_ts_tob with the following contents:
df_ts_tob = fn.historical_spread(ob_data)
df_ts_tob
| timestamp | bid | ask | spread | |
|---|---|---|---|---|
| 0 | 2021-07-05T13:06:49.495Z | 28298.3 | 28298.4 | 0.1 |
| 1 | 2021-07-05T13:06:52.415Z | 28298.3 | 28298.4 | 0.1 |
| 2 | 2021-07-05T13:06:55.500Z | 28309.9 | 28310.0 | 0.1 |
| 3 | 2021-07-05T13:06:58.414Z | 28309.9 | 28310.0 | 0.1 |
| 4 | 2021-07-05T13:07:01.503Z | 28319.9 | 28320.0 | 0.1 |
| ... | ... | ... | ... | ... |
| 1196 | 2021-07-05T14:06:37.413Z | 28402.8 | 28402.9 | 0.1 |
| 1197 | 2021-07-05T14:06:40.499Z | 28396.8 | 28396.9 | 0.1 |
| 1198 | 2021-07-05T14:06:43.413Z | 28396.8 | 28396.9 | 0.1 |
| 1199 | 2021-07-05T14:06:46.749Z | 28396.8 | 28396.9 | 0.1 |
| 1200 | 2021-07-05T14:06:49.417Z | 28396.8 | 28396.9 | 0.1 |
1201 rows Ć 4 columns
In order to see some interesting time periods, lets see if we can find the timestamps where spread was above to 8.0 Usd.
df_ts_tob.query("spread > 8.0")
| timestamp | bid | ask | spread | |
|---|---|---|---|---|
| 32 | 2021-07-05T13:08:25.495Z | 28321.6 | 28332.3 | 10.7 |
| 33 | 2021-07-05T13:08:28.414Z | 28321.6 | 28332.3 | 10.7 |
| 36 | 2021-07-05T13:08:37.500Z | 28326.5 | 28334.6 | 8.1 |
| 37 | 2021-07-05T13:08:40.443Z | 28326.5 | 28334.6 | 8.1 |
| 122 | 2021-07-05T13:12:55.501Z | 28403.1 | 28414.6 | 11.5 |
| 123 | 2021-07-05T13:12:58.421Z | 28403.1 | 28414.6 | 11.5 |
| 124 | 2021-07-05T13:13:01.502Z | 28403.0 | 28414.6 | 11.6 |
| 130 | 2021-07-05T13:13:19.660Z | 28399.5 | 28409.7 | 10.2 |
| 131 | 2021-07-05T13:13:22.421Z | 28399.5 | 28409.7 | 10.2 |
| 183 | 2021-07-05T13:15:58.501Z | 28405.4 | 28413.8 | 8.4 |
| 184 | 2021-07-05T13:16:01.505Z | 28404.3 | 28412.4 | 8.1 |
| 185 | 2021-07-05T13:16:04.416Z | 28404.3 | 28412.4 | 8.1 |
| 223 | 2021-07-05T13:17:58.494Z | 28366.8 | 28375.7 | 8.9 |
| 224 | 2021-07-05T13:18:01.423Z | 28366.8 | 28375.7 | 8.9 |
| 246 | 2021-07-05T13:19:07.500Z | 28340.7 | 28349.2 | 8.5 |
| 437 | 2021-07-05T13:28:40.502Z | 28361.8 | 28370.3 | 8.5 |
| 478 | 2021-07-05T13:30:43.497Z | 28386.4 | 28396.1 | 9.7 |
| 479 | 2021-07-05T13:30:46.415Z | 28386.4 | 28396.1 | 9.7 |
| 662 | 2021-07-05T13:39:55.501Z | 28372.4 | 28381.3 | 8.9 |
| 663 | 2021-07-05T13:39:58.431Z | 28372.4 | 28381.3 | 8.9 |
| 877 | 2021-07-05T13:50:40.682Z | 28431.0 | 28439.6 | 8.6 |
| 878 | 2021-07-05T13:50:43.414Z | 28431.0 | 28439.6 | 8.6 |
Lets visualize how the bid and the ask have been in the whole time series data we have. You have to create a time series plot, using plotly, exactly like this one:
vs.plot_line_ts(df_ts_tob)
We will need to see the spread in time batches, so, lets see if you can split the data into hours and create a boxplot for every group, like the following:
df_ts_tob['spread'].describe()
count 1201.000000 mean 1.125062 std 2.181146 min 0.100000 25% 0.100000 50% 0.100000 75% 0.100000 max 11.600000 Name: spread, dtype: float64
Hint: You will need to have the timestamp converted to only minutes, and use these two parameters when calling the plotly function:
fig, spread_data = vs.plot_bar_ts(df_ts_tob)
fig.update_layout(title='SPREAD BOXPLOT')
fig.show()
For the following calculations, you can use your own way and commands to create the code for those, yet, there are some suggestions for using numpy arrays. Feel free to focus on the levels of engineering development as mentioned in class, assign this priority level: 1) It works, 2) Its efficient, 3) Its elegant.
Orderbook Imbalance
Using only the Top of the Book (TOB) information, which means, only the Bid and Ask volumes. Or, more than 1 level up to a level as depth selection for each side
# OrderBook Imbalance (v: volume, d: depth)
# v[0] Bid volume, v[1] Ask volume
imbalance = lambda v, d : np.sum(v[0][:d])/np.sum([v[0][:d], v[1][:d]])
# Delete this comment and place your code here in this cell
Spread
Also known as Quote Spread, the most commonly used spread definition and the one we use predominantly in our systems.
# OrderBook Spread (p: price)
# p[1] Ask price, p[0] Bid price
spread = lambda p: (p[1] - p[0])
# Delete this comment and place your code here in this cell
Midprice
Also known as Arithmetic Midprice, the most commonly used formula and the one that is present in our systems.
# Midprice (p: price)
# p[0]: Bid price, p[1]: Ask price
midprice = lambda p: (p[0] + p[1])*0.5
# Delete this comment and place your code here in this cell
Weighted-Midprice
It uses only the TOB prices and their corresponding volumes. It represents the weighted average of the Bid () and the Ask (), weighted by their relative volumes and .
If there are a lot of Buyers (), then the Weighted-Midprice is pushed towards (since will multiply and divide ), the opposite effect for the case of having a lot of Sellers is also true. This indicates the Buy or Sell pressure in the market and thus the tendency of the price to move either towards the Bid or Ask side.
# Delete this comment and place your code here in this cell
[1] Munnoz, 2020. Python project template. https://github.com/iffranciscome/python-project. (2021).